Novel Bounds on Marginal Probabilities 



Novel Bounds on Marginal Probabilities 



Joris M. Mooij 

MPI for Biological Cybernetics, Dept. Scholkopf 
Spemannstrafle 38, 72076 Tiibingen, Germany 



JORIS. MOOIJtyJTUEBINGEN. MPG.de 



00 

o 
o 

(N 
(N 



Hilbert J. Kappen 

Department of Biophysics 

Radboud University Nijmegen 

6525 EZ Nijmegen, The Netherlands 

Editor: 



B.KAPPENiSSCIENCE. RU.nl 



> 
a^ 
cn 

o 

oo 

o 



Abstract 

We derive two related novel bounds on single- variable marginal probability distributions in 
factor graphs with discrete variables. The first method propagates bounds over a subtree 
of the factor graph rooted in the variable, and the second method propagates bounds over 
the self-avoiding walk tree starting at the variable. By construction, both methods not only 
bound the exact marginal probability distribution of a variable, but also its approximate 
Belief Propagation marginal ("belief"). Thus, apart from providing a practical means to 
calculate bounds on marginals, our contribution also lies in an increased understanding 
of the error made by Belief Propagation. Empirically, we show that our bounds often 
outperform existing bounds in terms of accuracy and/or computation time. We also show 
that our bounds can yield nontrivial results for medical diagnosis inference problems. 
Keywords: Graphical Models, Factor Graphs, Inference, Marginal Probability Distribu- 
tions, Bounds 



1. Introduction 

Graphical models are used in many different fields. A fundame ntal problem i n the appli- 
cation of graphical models is that exact inference is NP-hard (jCooperl . Il99nl ^. In recent 
years, much research has focused on approximate inference techniques, such as sar npling 
meth ods and deterministic approximation methods, e.g., Belief Propagation (BP) ( Pearl 
I988I ). Although the approximations obtained by these methods can be very accurate, there 
are only few guarantees on the error of the approximation, and often it is not known (with- 
out comparing with the exact solution) how accurate an approximate result is. Thus it 
is desirable to calculate, in addition to the approximate results, tight bounds on the ap' 



the partition sum, e.g., (jjaakkola and Jordanl . .1996 : Wainwright et al.l . l2005l ) . can be com 



proximation error. Existiiig methods to calculate bounds on marg inals include (jTatikondal . 
20031 : iLeisink and KaDDen l '2003': ' Xaea and M^ . '2006'; 'lh lei].l200'?11. Also, up per bounds on 



bined with lower bounds o n the partition sum, such as the well-known mean field bound or 
higher-order lower bounds ( Leisink and Kappen . 200 ll ). to obtain bounds on marginals. 

In this article, we derive novel bounds on exact single- variable marginals in factor graphs. 
The original motivation for this work was to better understand and quantify the BP error. 
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This has lead to bounds which are at the same time bounds for the exact single-variable 
marginals as well as for the BP beliefs. A particularly nice feature of the bounds is that their 
computational cost is relatively low, provided that the number of possible values of each 
variable in the factor graph is small. Unfortunately, the computation time is exponential in 
the number of possible values of the variables, which limits application to factor graphs in 
which each variable has a low number of possible values. On these factor graphs however, our 
bounds perform exceedingly well and we show empirically that they outperform the state- 
of-the-art in a variety of factor graphs, including real-world problems arising in medical 
diagnosis. 

This article is organized as follows. In the next section, we derive our novel bounds. 
In Section [Sj we discuss related work. In Section |4] we present experimental results. We 
conclude with conclusions and a discussion in Section [5j 



2. Theory 

In this work, we consider graphical models such as Marko v random fields and Bay esian 
networks. We use the unifying factor graph representation ( Kschischang et al. . 200ll ). In 



the first subsection, we introduce our notation and some basic definitions concerning factor 
graphs. Then, we shortly remind the reader of some basic facts about convexity. After that, 
we introduce some notation and concepts for measures on subsets of variables. We proceed 
with a subsection that considers the interplay between convexity and the operations of 
normalization and multiplication. In the next subsection, we introduce "(smallest bounding) 
boxes" that will be used to describe sets of measures in a convenient way. Then, we formulate 
the basic lemma that will be used to obtain bounds on marginals. We illustrate the basic 
lemma with two simple examples. Then we formulate our first result, an algorithm for 
propagating boxes over a subtree of the factor graph, which results in a bound on the 
marginal of the root variable of the subtree. In the last subsection, we show how one can go 
deeper into the computation tree and derive our second result, an algorithm for propagating 
boxes over self-avoiding walk trees. The result of that algorithm is a bound on the marginal 
of the root variable (starting point) of the self-avoiding walk tree. For the special case where 
all factors in the factor graph depend on two variables at most ( "pairwise interactions" ) , our 
first result is equivalent to a truncation of the second one. This is not true for higher-order 
interactions, however. 

2.1 Factor graphs 

Let V := {1, . . . ,N} and consider N discrete random variables (xj)jgv Each variable Xi 
takes values in a discrete domain Xi. We will frequently use the following multi-index 
notation. Let A = {ii,i2, ■ ■ ■ , im} ^ V with ii < 12 < ■ ■ ■ im- We write Xa '■= -^h x '^12 ^ 
■ • • X Xi^ and for any family (li)iG_B with A C 5 C V, we write Ya ■= {Yi^,Yi^,. ..,Yi^). 

We consider a probability distribution over x = (xi, . . . ,xn) G X\; that can be written 
as a product of factors (also called "interactions") {tpf)j^jr: 

n^) = I n M^N,), ^ = E n M^^i)- (1) 
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F(Xi,Xj,Xk) = ^'lJjj{xi,Xj)lpK{Xi,Xk)lpLiXj,Xk) 

Z = ^ ^ ^ ■lljj{Xi,Xj)lpK{Xi,Xk)lljL{Xj,Xk) 
Xi^Xi Xj^Xj XkGXf^ 

Figure 1: Example of a factor graph with three variable nodes {i, j, k), represented as circles, 
and three factor nodes {J,K,L), represented as rectangles. The corresponding 
random variables ; the corresponding factors are ■0j : <Yj x Xj — > 

[0,cx)), ipK '■ >i Xk ^ [0,00) and 'i/'L : -^j x Afc — > [0, 00). The corresponding 
probability distribution P(x) is written out on the right. 




For each factor index I £ J^, there is an associated subset Nj C V of variable indices and 
the factor ipj is a nonnegative function ipj : Xi\ij — > [0,oo). For a Bayesian network, the 
factors are (conditional) probability tables. In case of Markov random fields, the factors 
are often called potentialsjl] In the following, we will use lowercase for variable indices and 
uppercase for factor indices. 

In general, the normalizing constant Z is not known and exact computation of Z is 
infeasible, due to the fact that the number of terms to be summed is exponential in N. 
Similarly, computing marginal distributions for subsets of variables ^ C V is in- 

tractable in general. In this article, we focus on the task of obtaining rigorous bounds on 
single- variable marginals P(xj) = X^^-^^^.j P(x). 

We can represent the structure of the probability distribution ([T|) using a factor graph 
{V,T,£). This is a bipartite graph, consisting of variable nodes i £V, factor nodes I & 
and edges e £ £, with an edge {i,I} between i € V and I £ T if and only if the factor 
■0/ depends on Xi (i.e., if z G Nj). We will represent factor nodes visually as rectangles 
and variable nodes as circles. Figure [U shows a simple example of a factor graph and the 
corresponding probability distribution. The set of neighbors of a factor node / is precisely 
Nj] similarly, we denote the set of neighbors of a variable node i hy Ni := {I £ : i £ Nj}. 
Further, we define for each variable i £ V the set Ai := \J Ni consisting of all variables 
that appear in some factor in which variable i participates, and the set di := Ai \ {i}, the 
Markov blanket of i. 

We will assume throughout this article that the factor graph corresponding to ([T]) is 
connected. Furthermore, we will assume that 

yi £ T yi £ Ni yXNj\{iy £ AfAr^\{i} : ^ 1pi{xi, X > 0. 

This will prevent technical problems regarding normalization later onH 

One final remark concerning notation: we will sometimes abbreviate {i} as i if no 
confusion can arise. 



1. Not to be confused with statistical physics terminology, where "potential" refers to —^logipi instead, 
with P the inverse temperature. 

2. This condition ensures that if one runs Belief Propagation on the factor graph, the messages will always 
remain nonzero, provided that the initial messages are nonzero. 
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2.2 Convexity 

Let y be a real vector space. For T elements {vt)t=i,...,T of V and T nonnegative numbers 
{\t)t=i,...,T with Ylt=i '^t ~ 1' ^^^^ Ylt=i ^t'^t ^ convex combination of the {vt)t=i,...,T 
with weights {\t)t=i,...,T- A subset X C V is called convex if for all xi,X2 € X and all 
A € [0, 1], the convex combination Axi + (1 — A)x2 € X. An extreme point of a convex set 
X is an element x G X which cannot be written as a (nontrivial) convex combination of 
two different points in X. In other words, x € X is an extreme point of X if and only if for 
all A G (0, 1) and all xi,X2 G X, x = Axi + (1 — A)x2 implies xi = X2- We denote the set 
of extreme points of a convex set X by Ext (X). For a subset Y of the vector space V, we 
define the convex hull of Y to be the smallest convex set X <Z V with y C X; we denote 
the convex hull of Y as Hull (Y). 

2.3 Measures and operators 

For ^ C V, define Ma '■= [0, oo)"^^, i.e., Ma is the set of nonnegative functions on Xa- -Ma 
can be identified with the set of finite measures on X^. We will simply call the elements of 
Ma "measures on A\ We also define M\ := Ma \ {0}. We will denote M := Uyicv-^^ 

and >I* := Uacv^A- 

Adding two measures $ G Ma results in the measure ^' + <I> in Ma- For A,BQV, 

we can multiply an element of Ma with an element of to obtain an element of Maub] 

a special case is multiplication with a scalar. Note that there is a natural embedding of Ma 

in Mb ^OT A C B C V obtained by multiplying an element ^' G Ma by 1^^^^ G Mb\a^ 

the constant function with value 1 on Xb\a- Another important operation is the partial 

summation: given A C B C V and "if € Mb, define X^^.^ ^ to be the measure in Mb\a 

that satisfies 

I X* I {xb\a) = X] '^(^a,xb\a) ^xb\a e ^B\A- 

Also, defining yl' = i? \ ^, we will sometimes write this measure as ^, which is 

an abbreviation of ^. This notation does not make explicit which variables are 

summed over (which depends on the measure that is being partially summed), although it 
shows which variables remain after summation. 

In the following, we will implicitly define operations on sets of measures by applying the 
operation on elements of these sets and taking the set of the resulting measures; e.g., if we 
have two subsets Ea C Ma and Eb ^ Mb ^ov A, B C V , we define the product of the sets 
Ea and to be the set of the products of elements of Ea and Eb, i.e., EaEb ■= {^'a^b : 
^'A G EA,'i>B € Eb}. 

In Figure [21 the simple case of a binary random variable Xj and the subset A = {i} is 
illustrated. Note that in this case, a measure \I' G Mi can be identified with a point in the 
quarter plane [0,oo) x [0,(X)). 

We will define Qa to be the set of completely factorized measures on A, i.e., 

Qa ■■= n -^{4 = S n *» • ^ -^{4 ^^^^ a G ^ I . 

aeA laeA ) 
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Xi = -1 





Mi 










1 


> 

X; — 



Figure 2: Illustration of some concepts in the simple case of a binary random variable 
Xi G Xi = {±1} and the subset A = {i}. A measure ^ G Mi can be identified 
with a point in the quarter plane as indicated in the Figure. A normalized measure 
can be obtained by scaling the result J\f^ is contained in the simplex Vi, a 
lower-dimensional submanifold of Mi- 



Note that A4a is the convex hull of Qa- Indeed, we can write each measure ^ G Ma as a 
convex combination of measures in Qa] let Z := X^^,^ ^ and note that 

= E ^ iZ5y{x)) Vrc G Xa. 
V&Xa 

For any y G Xa, the Kronecker delta function 5y G Ma (which is 1 if its argument is 
equal to y and otherwise) is an element of Qa because 5y{x) = HaeA ^Vai^aj- We denote 
Qa ■■= Qa \ {0}. 

We define the partition sum operator Z : M ^ [0, oo) which calculates the partition 
sum (normalization constant) of a measure, i.e., 

:= ^{xa) for * G Ma, ACV. 

We denote with Va the set of probability measures on A, i.e., Va = G Ma '■ Z'^ = 1}, 
and define V := Uacv"^^- "^^^ '^A is called a simplex (see also Figure [2]). Note that 
a simplex is convex; the simplex Va has precisely ^{Xa) extreme points, each of which 
corresponds to putting all probability mass on one of the possible values of xa- 

Define the normalization operator J\f : M* ^ V which normalizes a measure, i.e., 

Af^ := for ^ G M*. 

Z^ 

Note that Z o M = 1. Figure [2] illustrates the normalization of a measure in a simple case. 
2.4 Convex sets of measures 

To calculate marginals of subsets of variables in some factor graph, several operations per- 
formed on measures are relevant: normalization, taking products of measures, and summing 
over subsets of variables. In this section we study the interplay between convexity and these 
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Figure 3: Any convex combination of AA^i, AA^2 and M^s can be written as a normalized 
convex combination of , ^2 and ^3 . Vice versa, normalizing a convex combination 
of ^1:^2 and ^3 yields a convex combination oi MS,i,MS,2 and MS,3- 



operations; this will turn out to be useful later on, because our bounds make use of convex 
sets of measures that are propagated over the factor graph. 

The interplay between normalization and convexity is described by the following Lemma, 
which is illustrated in Figure [3l 

Lemma 1 Let A C V, T £ W and let {Ct)t=i,...,T be elements of A4\- Each convex 
combination of the normalized measures {^f£,t)t=i,...,T can be written as a normalized convex 
combination of the measures {(,t)t=i,...,T (which has different weights in general), and vice 
versa. 

Proof Let {Xt)t=i,...,T be nonnegative numbers with Ylt=i ~ ^- Then 

/ T \ ^ 

\t=i J t=i 

therefore 

E MM(.) = ^ f E MM(,)] = M (e ^(,) = N (e f') • 

t=i \t=i / \t=i ^* / \i=i 2^s=i zis / 

which is the result of applying the normalization operator to a convex combination of the 
elements (6)t=i,...,r- 

Vice versa, let {fJ,t)t=i,...,T be nonnegative numbers with Ylt=i l^t = 1- Then 

/ T \ T 

■V Em, =Et«. 

\t=i / t=i 

where 

(T \ T T 

E ) = E = E 

t=i / t=i t=i 
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where we defined Zt := Z^t for alH = 1, . . . , T. Thus 

(e mi) = E ^^T^e. = E ^^#^^5.. 

\t=l / t=l 2^s=l f^s^s f=l l^s=l fJ's'^s 

which is a convex combination of the normaUzed measures {Af^t)t=i,...,T- B 

The following lemma concerns the interplay between convexity and taking products; it 
says that if we take the product of convex sets of measures on different spaces, the resulting 
set is contained in the convex hull of the product of the extreme points of the convex sets. 
We have not made a picture corresponding to this lemma because the simplest nontrivial 
case would require at least four dimensions. 

Lemma 2 Let T G N* and (At)t=i,...,r be a family of mutually disjoint subsets of V . For 
each t = 1, . . . ,T, lefEfC AiAt be convex with a finite number of extreme points. Then: 



T / T \ 

CHull mExtSt , 

t=l \t=l / 



Proof Let € for each t = 1,...,T. For each t, ^'t can be written as a convex 
combination 

Yl E ^*;^t = l' V 6 € Ext (Hi) : At;j, > 0. 

etGExt(St) 6GExt(St) 

Therefore the product H^i is also a convex combination: 

fl*. = flf E 

t=i t=i UteExt(St) 



E E - E n^.* ne 

5ieExt(Si)^2GExt(S2) ^TGExt(ST) \t=i / \t=l 



T \ / T 



e Huh ^JjExtSt^ 



2.5 Boxes and smallest bounding boxes 

In this subsection, we define "(smallest bounding) boxes" , certain convex sets of measures 
that will play a central role in our bounds, and study some of their properties. 

Definition 3 Let A C V. For * G Ma and I' G Ma with * < we define the box 
between the lower bound and the upper bound ^ by 

Ba := G Ma :*<*<*}. 
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Mi 















Figure 4: The smallest bounding box B (H) for H is given by the box Bi \1') with lower 
bound and upper bound 




Xi = +1 




Figure 5: Multiplication of two boxes on the same variable set A = {i}. 



The inequalities should be interpreted pointwise, e.g., < means ^{x) < ^'(x) for all 

X £ Xyi. Note that a box is convex; indeed, its extreme points are the "corners" of which 
there are 2*'-^^\ 

Definition 4 Let A C V and E C Ma be bounded (i.e., E < for some ^ e Ma)- The 

smallest bounding box for E is defined as B (E) := Ba where ^' € Ma fl'^e given 
by 

^{xa) := inf{^'(xA) : ^' G H} Vx^ G Xa, 

^'(xa) := sup{^(xa) : ^' G H} Vxa G Xa- 

Figure m illustrates this concept. Note that B (E) = B (Hull (E)). Therefore, if E is convex, 
the smallest bounding box for E depends only on the extreme points Ext (E), i.e., B (E) = 
i3(Ext(H)). 

The product of several boxes on the same subset A of variables can be easily calculated 
as follows (see also Figure [5]) . 

Lemma 5 Let A C V, T £ W and for each t = 1,...,T, let %,^t G Ma such that 
< "^t- Then 

T / T T \ 

J{Ba {%,^t) = Ba ' 
t=l \t=l t=l / 
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i.e., the product of the boxes is again a box, with as lower bound the product of the lower 
bounds of the boxes and as upper bound the product of the upper bounds of the boxes. 

Proof We prove the case T = 2; the general case follows by induction. We show that 

That is, for $ € Ma we have to show that 

'^i{x)^2{x) <^{x) < ^i{x)'^2{x) Vx G Xa 
if and only if there exist $1, $2 € A^a such that: 

^(x) = ^i{x)^2{x) \/x e Xa; 

:^i{x) <^i{x) <^i{x) ^xeXa; 

1.2{x) < ^2{x) < ^2{x) yxeXA- 

Note that the problem "decouples" for the various possible values of re G Xa so that we 
can treat each component (indexed by x G Xa) seperately. That is, the problem reduces to 
showing that 

[a, b] ■ [c,d\ = [ac, bd] 

for < a < 6 and < c < d (take a = *.i(x), b = "^i(x), c = '^2ix) and d = '^2(2;)). In 
other words, we have to show that y G [ac, bd] if and only if there exist yi G [a, b], y2 [c, d] 
with y = yiy2. For the less trivial part of this assertion, it is easily verified that choosing 
yi and y2 according to the following table: 



Condition 


yi 


2/2 


bc<y,b>0 


6 


y 

b 


6 = 





c 


bc> y,c> 


y 

c 


c 


bc> y,c = 


b 






does the job. 



In general, the product of several boxes is not a box itself. Indeed, let G V be two 
different variable indices. Then Bi {^^, ^'j) Bj [^j, "ifj) contains only factorizing measures, 
whereas B^^jy is not a subset of Q{ij} in general. However, we do have the 

following identity: 

Lemma 6 Let T G N* and for each t = 1, . . . ,T, let Af C V and %, '^t € MAt such that 
Then 

\t=\ / \t=l t=l J 

Proof Straightforward, using the definitions. ■ 



9 



MOOIJ AND KAPPEN 



(a) 



(b) 





Figure 6: The basic lemma: the smallest bounding box enclosing the set of possible 
marginals of is identical in cases (a) and (b), if we are allowed to put ar- 
bitrary factors on the factor nodes marked with question marks. 



2.6 The basic lemma 

After defining the elementary concepts, we can proceed with the basic lemma. Given the 
definitions introduced before, the basic lemma is easy to formulate. It is illustrated in 
Figure [H 

Lemma 7 Let A,B,C C V be mutually disjoint subsets of variables. Let ^ E A^aubuc 
such that for each xc G 'Vc, 



^' > 0. 



Then: 



B\M 



C 



V \xB,Xc / / 



Proof Note that is the convex hull of Q^. Furthermore, the multiplication with ^ 
and the summation over xb,xc preserves convex combinations, as does the normalization 
operation (see Lemma [1]). Therefore, 

M I Y ^^cj ^ Hull Im I Y ^Qc ) ) 

\xB,xc / \ \xB,xc / / 

from which the lemma follows. ■ 



The positivity condition is a technical condition, which in our experience is fulfilled for most 
practically relevant factor graphs. 



2.7 Examples 

Before proceeding to the first main result, we first illustrate for a simple case how the basic 
lemma can be employed to obtain bounds on marginals. We show two bounds for the 
marginal of the variable Xi in the factor graph in Figure [7|^a) . 
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Figure 7: (a) Factor graph; (b) Illustration of the bound on P(xi) corresponding to Exam- 
ple I; (c) Cloning node j by adding a new variable j' and a factor ips{xj,Xji) = 
Sxj{xji); (d) Illustration of the improved bound on P(xj), corresponding to Ex- 
ample (II), based on (c). 



2.7.1 Example I 

First, note that the marginal of Xj satisfies 

\ Xj Xj^ j \ Xj X]^ 

because, obviously, V'L G -^{jfc}' apply^g the basic lemma with A = {«}, B 

C = {j, k} and ^ = ipjipK, we obtain 



Applying the distributive law, we conclude 
which certainly implies 



[Xi] 



This is illustrated in Figure [Tl^b), which should be read as "What can we say about the 
range of P(xi) when the factors corresponding to the nodes marked with question marks 
are arbitrary?" Because of the various occurrences of the normalization operator, we can 
restrict ourselves to normalized measures on the question- marked factor nodes: 



[Xi 
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Now it may seem that this smallest bounding box would be difficult to compute, because 
in principle one would have to compute all the measures in the sets N Ylixj V' jT^j and 
J^Ylxj./'pK'Pk- Fortunately, we only need to compute the extreme points of these sets, 
because the mapping 

Xj 

maps convex combinations into convex combinations (and similarly for the other mapping, 
involving iI^k)- Since smallest bounding boxes only depend on extreme points, we conclude 
that 

which can be calculated efficiently if the number of possible values of each variable is small. 
2.7.2 Example II 

We can improve this bound by using another trick: cloning variables. The idea is to first 
clone the variable Xj by adding a new variable xj/ that is constrained to take the same 
value as xj. In terms of the factor graph, we add a variable node j' and a factor node 5, 
connected to variable nodes j and j' , with corresponding factor ijjs{xj,Xji) := 6xj{xj'); see 
also Figure [7^c). Clearly, the marginal of satisfies: 



Xf) 



where it should be noted that in the first line, V'-L is shorthand for ^(^^{xj^Xjf) but in the 
second line it is meant as shorthand for ipiixjiyXk)- Noting that ips G '^{j j'} ^'PPlyiiig 
the basic lemma with C = yields: 

P(x,) (^^[Y.J2J2 MK^LMl^,ry I e ^-A/" ( E E E ^j^k^^lQU,,} I . 

y Xj x-i y y Xj Xji Xk j 

Applying the distributive law, we obtain (see also Figure [T^d)): 



Fix,) G SAA I I Y.'^jMX^ I I Y.'^KY.^LMy 
from which we conclude 



} 

Xfc X.I 
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This can again be calculated efficiently by considering only extreme points. 

As a more concrete example, take all variables as binary and take for each factor ip = 
(21)- Then the first bound (Example I) yields: 




whereas the second, tighter, bound (Example II) gives: 




Obviously, the exact marginal is 

2.8 Propagation of boxes over a subtree 

We now formulate a message passing algorithm that resembles Belief Propagation. However, 
instead of propagating measures, it propagates boxes (or simplices) of measures; further- 
more, it is only applied to a subtree of the factor graph, propagating boxes from the leaves 
towards a root node, instead of propagating iteratively over the whole factor graph several 
times. The resulting "belief" at the root node is a box that bounds the exact marginal 
of the root node. The choice of the subtree is arbitrary; different choices lead to different 
bounds in general. We illustrate the algorithm using the example that we have studied 
before (see Figure E]) . 

Definition 8 Let (V, E) he a factor graph. We call the bipartite graph iy, F, E) a subtree 
of (V, JT, £) with root iifi&V'^V,FCJ^,EC£ such that {V, F, E) is a tree with root 
i and for all {j, J} ^ E, j and J & F (i.e., there are no "loose edges'')^ 

An illustration of a factor graph and a possible subtree is given in Figure [8l^a)-(b). We 
denote the parent of j (z V according to {V, F, E) by par(j) and similarly, we denote the 
parent of J € -F by par( J). In the following, we will use the topology of the original factor 
graph (V, £) whenever we refer to neighbors of variables or factors. 

Each edge of the subtree will carry one message, oriented such that it "flows" towards 
the root node. In addition, we define messages entering the subtree for all "missing" edges 
in the subtree. Because of the bipartite character of the factor graph, we can distinguish 
between two types of messages: messages Bj^j C J\/[j sent to a variable j ^ V from a 
neighboring factor J G Nj., and messages I3j—,j C sent to a factor J £ F from a 
neighboring variable j £ Nj. 

The messages entering the subtree are all defined to be simplices; more precisely, we 
define the incoming messages 

Bj^j = Vj J G F, {i, J]££\E 
Bj^j = j e V, {j, J}££\E. 

3. Note that this corresponds to the notion of subtree of a bipartite graph; for a subtree of a factor graph, 
one sometimes imposes the additional constraint that for all factors J £ F, all its connecting edges {J,j} 
with j £ Nj have to be in E; here we do not impose this additional constraint. 
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(a) (b) (c) 




Figure 8: Box propagation algorithm corresponding to Example II: (a) Factor graph G = 
{V,J^,£); (b) a possible subtree {V,F,E) of G; (c) propagating sets of measures 
(boxes or simplices) on the subtree leading to a bound Bi on the marginal prob- 
ability of Xi in G. 



We propagate messages towards the root i of the tree using the following update rules (note 
the similarity with the BP update rules). The message sent from a variable j (z V to its 
parent J = par(j) G F is defined as 

Bk^j if all incoming Bk^j are boxes 

Bj^j = < K£Nj\J 

Vj if at least one of the Bx^j is the simplex Vj , 

where the product of the boxes can be calculated using Lemma EJ The message sent from 
a factor J G F to its parent k = par(J) G F is defined as 

Bj^k = BN{ n ^'-^) • (2) 

\^Nj\k lGNj\k / 

This smallest bounding box can be calculated using the following Corollary of Lemma [2j 
Corollary 9 

\^Nj\k l£Nj\k / \^JV.j\fe leNj\k 

Proof By Lemma [21 

J] Bi^ J CRnll II Ext^^^j 

leNj\k \leNj\k 
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Because the multiplication with ipj and the summation over x^^^j^ preserves convex com- 
binations, as does the normalization (see Lemma [T|), the statement follows. ■ 

The final "belief" Bi at the root node i is calculated by 



BM BK-*j I if all incoming Bk^j are boxes 

\KeN, J 

Vj if at least one of the Bk^j is the simplex Vj . 



Note that when applying this to the case illustrated in Figure [U we obtain the bound that 
we derived earlier on ("Example 11"). 

We can now formulate our first main result, which gives a rigorous bound on the exact 
single-variable marginal of the root node: 

Theorem 10 Let (V, J^, 8) he a factor graph with corresponding probability distribution 
(Cp. Let i € V and {V,F,E) be a subtree of {V,T,£) with root i (z V . Apply the "box 
propagation" algorithm described above to calculate the final "belief" Bi on the root node i. 
Then F{xi) G Bi. 

Proof sketch The first step consists in extending the subtree such that each factor node 
has the right number of neighboring variables by cloning the missing variables. The second 
step consists of applying the basic lemma where the set C consists of all the variable nodes 
of the subtree which have connecting edges in £ \E, together with all the cloned variable 
nodes. Then we apply the distributive law, which can be done because the extended subtree 
has no cycles. Finally, we relax the bound by adding additional normalizations and smallest 
bounding boxes at each factor node in the subtree. It should now be clear that the recursive 
algorithm "box propagation" described above precisely calculates the smallest bounding box 
at the root node i that corresponds to this procedure. ■ 

N ote that a subtree of th e orgin al factor graph is also a subtree of the computation tree 
for i ( Tatikonda and Jordan . 20021 ). A computation tree is an "unwrapping" of the factor 



graph that has been used in analyses of the Belief Propagation algorithm. The computation 
tree starting at variable i G V consists of all paths on the factor graph, starting at i, that 
never backtrack (see also Figure[9]^c)). This means that the bounds on the (exact) marginals 
that we just derived are at the same time bounds on the approximate Belief Propagation 
marginals (beliefs). 



Corollary 11 In the situation described in Theorem \10[ the final bounding box Bi also 
bounds the (approximate) Belief Propagation marginal of the root node i, i.e., PBp(xj) G Bi. 
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(a) 



(b) 



K' 



(c) 



K' 



J' 



© © 



Figure 9: (a) Factor graph; (b) Self-avoiding walk tree with root i, with cycle-induced leaf 
nodes shown in gray; (c) Computation tree for i. 



2.9 Bounds using Self- Avoiding Walk Trees 

While writing this article, we became aware that a rela ted rn e thod to obtain bounds on 
single- variable marginals has been proposed recently by Ihler ( 200?! ) Fl The method pre- 
sented there uses a different local bound, which empirically seems to be less tight than 
ours, but has the advantage of being computationally less demanding if the domains of the 
random variables are large. On the other hand, the bound presented there does not use 
subtrees of the factor graph, but uses self-avoiding walk (SAW) trees instead. Since each 
subtree of the factor graph is a subtree of an SAW tree, this may lead to tighter bounds. 

The idea of using a self-a voiding walk tree for calculating marginal pr obabilities seems 
to be generally attributed to Weitz ( 20061 ). but can already be found in ( Scott and Sokal . 
200i). In this subsection, we show how this idea can be combined with the propagation of 



bounding boxes. The result Theorem [13] will turn out to be an improvement over Theorem 
[To] in case there are only pairwise interactions, whereas in the general case. Theorem [10] 
often yields tighter bounds empirically. 

Definition 12 Let Q := (V,jr, <f) he a factor graph and let i £ V. A self-avoiding walk 
(SAW) starting at i of length n is a sequence a = {01,02,03, . . . , On) € (V U JF)" 
that 

(i) starts at i gV, i.e., oi = i; 



4. Note that (jlhleil . [2OO7I . Lemma 5) contains an error: to obtain the correct expression, one has to replace 
5 with 5^, i.e., the correct statement would be that 



< pU) < 



52 + (1 - S^)m{j) " " 1 - (1 - S^)m{j) 
if d(p(a;)/m(x)) < 5 (where p and m should both be normalized). 
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(ii) subsequently visits neighboring nodes in the factor graph, i.e., Oj+i £ N^^ for all 
j = 1,2, . . . ,n - 1; 

(Hi) does not backtrack, i.e., aj 7^ for all j = 1,2, ... ,n — 2; 

(iv) the first n—1 nodes are all different, i. e., aj 7^ ak if j 7^ k for j, k € {1, 2, . . . , n — 1} Jl 

The set of all self-avoiding walks starting at i £ V has a natural tree structure, defined 
by declaring each SAW {01,02, ■ ■ ■ , On, On+i) to be a child of the SAW {ai, 02, ■ ■ ■ , On), for 
all n € N*; the resulting tree is called the self-avoiding walk (SAW) tree with root i € V, 
denoted T^^^{i). 

Note that the name "self- avoiding walk tree" is slightly inaccurate, because the last 
node of a SAW may have been visited alr eady. I n gen eral, the SAW tree can be much larger 



than the original factor graph. Following llhlerj (j2007l ). we call a leaf node in the SAW tree 



a cycle-induced leaf node if it contains a cycle (i.e., if its final node has been visited before 
in the same walk), and call it a dead-end leaf node otherwise. We denote the parent of node 
a in the SAW tree by par (a) and we denote its children by ch(a). The final node of a SAW 
o = {oi, . . . ,an) is denoted by Q{a) = On- An example of a SAW tree for our running 
example factor graph is shown in Figure [9l^b). 

Let Q = (V, !F, £) be a factor graph and let i G V. We now define a propagation 



algorithm on the SAW tree Tg^'^ [i), where each node a G Tg'^'^ [i) (except for the root i) 
sends a message -BQ,^par(Q) to its parent node par(Q) G Tg {i). Each cycle- induced leaf 



node of Tg^'^ {i) sends a simplex to its parent node: if a is a cycle- induced leaf node, then 



_\Vg(o.) ifa(a)€V 

[^g(par(a)) if ^(a) G J^. 

All other nodes o in the SAW tree (i.e., the dead-end leaf nodes and the nodes with children, 
except for the root i) send a message according to the following rules. If Q{a) G V, 



t3 



Bfs^a if all Bfs^a are boxes 

< /3Gch{a) (4) 

Vg(a) if at least one of the Bp^a is a simplex. 

On the other hand, if Q{o) G J-, 

^a-^par(a) = BN { I J] ^^^^ I I ' 

\a;\6{par{c«)) \/3ech(a) / / 

The final "belief" at the root node i G V is defined as: 



B, 



BM Y\ I if all Bp^i are boxes 

\/3ech(i) / 

if at least one of the Bp^i is a simplex. 



(6) 



5. Note that (iii) almost follows from (iv), except for the condition that a„_2 7^ ctn 
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We will refer to this algorithm as "box propagation on the SAW tree"; it is similar to the 
propagation algorithm for boxes on subtrees of the factor graph that we defined earlier. 
However, note that whereas ([2]) bounds a sum-product assuming that incoming measures 
factorize, ([5]) is a looser bound that also holds if the incoming measures do not necessarily 
factorize. In the special case where the factor depends only on two variables, the updates 
([2]) and ([5]) are identical. 

Theorem 13 Let Q := {V,J^,£) be a factor graph. LetieV and let T|^^(i) be the SAW 
tree with root i. Then P(xj) £ Bi, where Bi is the bounding box that results from propagating 
bounds on the SAW tree Tg^^ {i) according to equations 

The following lemma, illustrated in Figure [TOl plays a crucial role in th e proo f of th e 
theorem. It seems to be related to the so-called "telegraph expansion" used in IWeitzl (|2006l ). 



Lemma 14 Let A, C C V he two disjoint sets of variable indices and let ^ G M.a\jc be a 
factor depending on (some of) the variables in AuC. Then: 




where 



Proof We assume that C = 0; the more general case then follows from this special case 
by replacing ^ by X^i^ ^ ' 

Let A = {ii, • • • ) in] and let ^[.j, be the lower and upper bounds corresponding to 
Bi, for all i ^ A. For each k = 1, 2, . . . , n, note that 

\Z=1 / \l=k+l / 

for all Xjj^^^^ G Therefore, we obtain from the definition of Bi^ that 

for all A; = 1, 2, . . . , 71. Taking the product of these n inequalities yields 

n n 
k=l k=l 

pointwise, and therefore M"^ G B (11^=1 ^ik)- ' 

The following corollary is somewhat elaborate to state, but readily follows from the 
previous lemma after attaching a factor / that depends on all nodes in A and one additional 
newly introduced node i: 
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Figure 10: The second basic lemma: the marginal on Xj^ in (a) is contained in the bounding 
box of the product of smallest bounding boxes Bi for i A, where (b) the 
smallest bounding box Bi is obtained by putting arbitrary factors on the other 
variables in ^4 \ {i} and calculating the smallest bounding box on i, illustrated 
here for the case i = ii. 



Corollary 15 Let Q = {V,J^,£) be a factor graph. Let z G V with exactly one neighbor in 
T , say Ni = {/}. Then P(xi) E Bi where 




and 



with 



Bi = BM\ ^i^iBi II B 

^^Nj\i \keNi\i 



(7) 



We now proceed with a sketch of the proof of Theorem [131 which was inspired by (jihleij . 



20071 ). 



Proof sketch The proof proceeds using structural induction, recursively transforming the 
original factor graph Q into the SAW tree Tg"^^{i), refining the bound at each step, until 
it becomes equivalent to the result of the message propagation algorithm on the SAW tree 
described above in equations ([Si)-®. 

Let g := iV,J^,£) be a factor graph. Let i G V and let r|^^(i) be the SAW tree with 
root i. Let {/i, . . . , /„} = N^. 

Suppose that n > 1. Consider the equivalent factor graph Q' that is obtained by creating 
n copies in of the variable node i, where each copy ij is only connected with the factor node 
Lj (for j = 1, . . . ,n); in addition, all copies are connected with the original variable i using 
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the delta function ips ■= (^(xj, Xj^, . . . , This step is ihustrated in Figure [TTIfa)-(b). 

Applying Corollary [15] to Q' yields the following bound which follows from ([7]) because of 
the properties of the delta function: 



In the expression on the right-hand side, the factor ■0/^ implicitly depends on instead of i 
(for all = 1, . . . , n). This bound is represented graphically in Figure [T]T c)-(d) where the 
gray variable nodes correspond to simplices of single- variable factors, i.e., they are meant 
to be multiplied with unknown single-variable factors. Note that ([8]) corresponds precisely 
with 




(8) 



where 




j = 1, • • • 



n. 
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(a) 



(b) 





(c) 



»ii ) »i2 I na 



n 



i"2 



*31 1 ( *32 1 ( *33 



T\ 



13 




Figure 11: One step in the proof of Theorem I13t propagating bounds towards variable i 
in case it has more than one neighboring factor nodes Ji, . . . ,/„ (here, n = 3). 
Gray nodes represent added (unknown) single-variable factors, (a) Factor graph 
G. (b) Equivalent factor graph Q' . (c) Result of replicating Q n times, where 
in each copy of ^, i is replaced by exactly n copies ikj of i for j = 1, . . . , n, 
where i^j is connected only with the factor Ij in Q}^. Then, the original variable 
i is connected using a delta factor with n of its copies ijj for j = 1, . . . , n. (d) 
Simplification of (c) obtained by identifying i with its n copies ijj for j = 1, . . . , n 
and changing the layout slightly. 
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(a) 




(b) 



, Jll 1 J12 I J13 




[ J31 } 1 J32 1 I J33 



(c) 




, J21 ) I J23 



1 J31 I J32 



Figure 12: Another step in the proof of Theorem[T3l propagating bounds towards variable i 
in case it has exactly one neighboring factor node / which has m + 1 neighboring 
variables {i,ji , ■ ■ ■ , jm}- (a) Factor graph Q. (b) Result of replicating /} m 
times and connecting the factor / with i and with copy jkk of jfc for k = 1, . . . ,m. 
(c) Equivalent to (b) but with a slightly changed layout. The gray copies of / 
represent (unknown) single- variable factors (on their neighboring variable). 
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The case that n = 1 is simpler because there is no need to introduce the delta function. 
It is illustrated in Figure [T2l Let {/} = Ni and let {ji, . . . ,jm} = ^iXi- Applying Corollary 
[TSl yields the following bound: 




eBM{ E^^^inOI (9) 



where 




V\{^,Jk} I , 

This bound is represented graphically in Figure [T2]^b)-(c) where the gray nodes correspond 
with simplices of single- variable factors. Note that ([9]) corresponds precisely with ([5]). 

Recursively iterating the factor graph operations in Figures [12] and [TTl the connected 
component that remains in the end is precisely the SAW tree Tg^^ [i)] the bounds derived 
along the way correspond precisely with the message passing algorithm on the SAW tree 
described above. ■ 



Again, the self-avoiding walk tree with root i is a subtree of the computation tree for 
i. This means that the bounds on the exact marginals given by Theorem 1131 are bounds on 
the approximate Belief Propagation marginals (beliefs) as well. 

Corollary 16 In the situation described in Theorem \1S\ the final hounding box Bi also 
hounds the (approximate) Belief Propagation marginal of the root node i, i.e., PBp(xj) E Bi. 



3. Related work 

There exist many other bounds on single-variable marginals. Also, bounds on the partition 
sum can be used to obtain bounds on single-variable marginals. For all bounds known to 
the authors, we will discuss how they compare with our bounds. In the following, we will 
denote exact marginals as Pi{xi) := F{xi) and BP marginals (beliefs) as bi{xi) := Fspixi). 



3.1 The Dobrushin-Tatikonda bound 

Tatikondal Jiool) derived a bound on the error of BP marginals using mathematical tools 
from Gibbs measure theory ( Georgii . 19881 ). in particular using a result known as Do- 
brushin's theorem. The bounds on the error of the BP marginals can be easily translated 
into bounds on the exact marginals: 



\bi{xi) - pi{xi)\ <e =^ pi{xi) G [bi{xi) - e,bi{xi) + e] 



for alH G V and Xi G X^. 

The Dobrushin-Tatikonda bound depends on the girth of the graph (the number of 
edges in the shortest cycle, or infinity if there is no cycle) and the properties of Dobrushin's 
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interdependence matrix, which is a x matrix C. The entry Cij is only nonzero if i G dj 
and in that case, the computational cost of computing its value is exponential in the size 
of the Markov blanket. Thus the computational complexity of the Dobrushin-Tatikonda 
bound is 0(maxj£v i^i'^di))^ plus the cost of running BP. 

3.2 The Dobrushin-Taga-Mase bound 

Inspired by the work of iTatikonda and Jordan (|2002l ). iTaga and Masd derived an- 

other bound on the error of BP marginals, also based on Dobrushin's theorem. This bound 
also depends on the properties of Dobrushin's interdependence matrix and has similar com- 
putational cost. Whereas the Dobrushin-Tatikonda bound gives one bound for all variables, 
the Dobrushin-Taga-Mase bound gives a different bound for each variable. 



3.3 Bound Propagation 

Leisink and Kappen ( 20031 ) proposed a method called "Bound Propagation" which can be 
used to obtain bounds on exact marginals. The idea underlying this method is very similar 
to the one employed in this work, with one crucial difference. Whereas we use a cavity 
approach, using as basis equation 



and bound the quantity P(xj) by optimizing over P^*(xgj), the basis equation employed by 
Bound Propagation is 

and the optimization is over P(xgj). Unlike in our case, the computational complexity is 
exponential in the size of the Markov blanket, because of the required calculation of the 
conditional distribution ¥{xi\xQi). On the other hand, the advantage of this approach 
is that a bound on F{xj) for j G di is also a bound on ¥{xQi), which in turn gives rise 
to a bound on P(xi). In this way, bounds can propagate through the graphical model, 
eventually yielding a new (tighter) bound on P(xgj). Although the iteration can result in 
rather tight bounds, the main disadvantage of Bound Propagation is its computational cost: 
it is exponential in the Markov blanket and often many iterations are needed for the bounds 
to become tight. Indeed, for a simple tree of = 100 variables, it can happen that Bound 
Propagation needs several minutes and still obtains very loose bounds (whereas our bounds 
give the exact marginal as lower and as upper bound, i.e., they arrive at the optimally tight 
bound). 

3.4 Upper and low^er bounds on the partition sum 

Various upper and lower bounds on the partition sum Z in ([1]) exist. An upper and a lower 
bound on Z can be combined to obtain bounds on marginals in the following way. First, 
note that the exact marginal of i satisfies 

Zi{xi) 



V{xi) 



Z 
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where we defined the partition sum of the clamped model as follows: 



Thus, we can bound 



Z, (xi 



< Pi{Xi) < 



z+ - ' - z- 

where Z~ < Z < Z~^ and Z~{xi) < Zi{xi) < Zf{xi) for all xi G Xi. 

A well-known lower bound of the partitio i i sum is the Mean Field bound. A tighter lower 
bound was derived by Leisink and Kappen ( 200ll ). An upper bound on the log partition 
sum was derived by IWainwright et alj (j2005l ) . Other lower and uppe r bounds (for the case 
of bin ary variables with pairwise interactions) have been derived bv lJaakkola and Jordan 
(|l996l l. 



4. Experiments 

We have done several experiments to compare the quality and computation time of various 
bounds empirically. For each variable in the factor graph under consideration, we calculated 
the gap for each bound Bi {^^, 3 which we define as the ^o-norm — ^^.jHq = 

maxaj.gA', \'^i{xi) - %{xi)\. 

We have used the following bounds in our comparison: 



DT: Dobrushin-Tatikonda (jTatikondal . l2003l . Proposition V.6). 
DTM: Dobrushin-Taga-Mase (|Taga and Masel . l2006l . Theorem 1). 

BoundProp: Bound Propagation ( Leisink and Kappen . 20031 ). using the implementation 
of M. Leisink, where we chose the maximum cluster size to be maxjgy #(Ai). 

BoxProp-SubT: Theorem 1101 where we used a simple breadth-first algorithm to recur- 
sively construct the subtree. 

BoxProp-SAWT: Theorem [131 where we truncated the SAW tree to at most 5000 nodes. 



Ihler-SAWT: Ihler's bound (llhled . 120071 ) . This bound has only been formulated for pair- 
wise interactions. 



Ihler-SubT: Ihler's bound (llhled . 120071 ) applied on a truncated version of the SAW tree, 
namely on the same subtree as used in BoxProp-SubT. This bound has only been 
formulated for pairwise interactions. 

In addition, we compared with appropriate combinations of the following bounds: 

MF: Mean-field lower bound. 



LK3: Third-order lower bound ([Leisink and Kappenl . [200 1[ . Eq. (10)), where we took for 
Hi the mean field solutions. This bound has been formulated only for the binary, 
pairwise case. 
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JJ: 



Refined upper bound ( Jaakkola and Jordan . 1996I . Section 2.2), with a greedy opti- 
mization over the parameters. This bound has been formulated only for the binary, 
pairwise case. 



TRW: Our implementation of ( Wainwright et al. . 20051 ). This bound has been formulated 
only for pairwise interactions. 



For reference, we calculated the Belief Propagation (BP) errors by comparing with the 
exact marginals, using the £q distance as error measure. 



4.1 Grids with binary variables 

We considered a 5 x 5 Ising grid with binary (itl-valued) variables, i.i.d. spin-glass nearest- 
neighbor interactions Jij ~ A/'(0, /3^) and i.i.d. local fields 9i ~ J\f{0,(3'^), with probability 
distribution 

^(x) = ^ exp I ^ 9iXi + ^J2J2 '^iJ^i^j 1 • 

\iev iev jedi J 

We took one random instance of the parameters J and 9 (drawn for /? = 1) and scaled 
these parameters with the interaction strength parameter /3, for which we took values in 
{10-2, 10-\ 1,10}. 

The results are shown in Figure [T3l For very weak interactions (/3 = 10^^), BoxProp- 
SAWT gave the tightest bounds of all other methods, the only exception being Bound- 
Prop, which gave a somewhat tighter bound for 5 variables out of 25. For weak and moder- 
ate interactions {(3 = 10'^, 1), BoxProp-SAWT gave the tightest bound of all methods for 
each variable. For strong interactions (/3 = 10), the results were mixed, the best methods 
being BoxProp-SAWT, BoundProp, MF-TRW and LK3-TRW. Of these, BoxProp- 
SAWT was the fastest method, whereas the methods using TRW were the slowestl^ For 
13 = 10, we present scatter plots in Figure [H] to compare the results of some methods 
in more detail. These plots illustrate that the tightness of bounds can vary widely over 
methods and variables. 

Among the methods yielding the tightest bounds, the computation time of our bounds 
is relatively low in general; only for low interaction strengths BoundProp is faster than 
BoxProp-SAWT. Furthermore, the computation time of our bounds does not depend 
on the interaction strength, in contrast with iterative methods such as BoundProp and 
MF-TRW, which need more iterations for increasing interaction strength (as the vari- 
ables become more and more correlated). Further, as expected, BoxProp-SubT needs 
less computation time than BoxProp-SAWT but also yields less tight bounds. Another 
observation is that our bounds outperform the related versions of Ihler's bounds. 



6. We had to loosen the convergence criterion for the inner loop of TRW, otherwise it would have taken 
hours. Since some of the bounds are significantly tighter than the convergence criterion we used, this 
may suggest that one can loosen the convergence criterion for TRW even more and still obtain good 
results using less computation time than the results we present here. Unfortunately, it is not clear how 
this criterion should be chosen in an optimal way without actually trying different values and using the 
best one. 
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P = 10"' 
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LK3-TRW 
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MF-TRW 
Ihler-SAWT 
BoxProp-SAWT 
Ihler-SubT 
BoxProp-SubT 
BoundProp 
DTK 
DT 
BP 
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BP - 
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gure 13: Results for grids with binary variables. The first four graphs show, for different 
values of the interaction strength /?, the gaps of bounds on marginals calcu- 
lated using different methods. Also shown for reference are the errors in the 
BP approximations to the same marginals. The final graph shows the total 
computation time for each method. 
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Bound Prop 



Figure 14: Scatter plots comparing some methods in detail for grids with binary variables 
for strong interactions {f3 = 10). 



4.2 Grids with ternary variables 

To evaluate the bounds beyond the special case of binary variables, we have performed 
experiments on a 5 x 5 grid with ternary variables and pairwise factors between nearest- 
neighbor variables on the grid. The entries of the factors were i.i.d., drawn by taking a 
random number from a normal distribution J\f (O, with mean and standard deviation 
f3 and taking the exp of that random number. 

The results are shown in Figure [T5j We have not compared with bounds involving J J 
or LK3 because these methods have only been formulated originally for the case of binary 
variables. This time, our method BoxProp-SAWT yielded the tightest bounds for all 
interaction strengths and for all variables (although this is not immediately clear from the 
plots). 



4.3 Medical diagnosis 



We a lso applied the bounds on simulated Promedas patient cases (IWemmenhove et al 



20071 ). These factor graphs have binary variables and singleton, pairwise and triple inter- 
actions (containing zeros). Two examples of such factor graphs are shown in Figure [T6l 
Because of the triple interactions, less methods were available for comparison. 

The results of various bounds for nine different, randomly generated, instances are 
shown in Figure [TTl The total number of variables for these nine instances was 1270. The 
total computation time needed for BoxProp-SubT was 51s, for BoxProp-SAWT 149 s, 
for BoundProp more than 75000 s (we aborted the method for two instances because 
convergence was very slow, which explains the missing results in the plot) and to calculate 
the Belief Propagation errors took 254 s. BoundProp gave the tightest bound for only 1 
out of 1270 variables, BoxProp-SAWT for 5 out of 1270 variables and BoxProp-SubT 
gave the tightest bound for the other 1264 variables. 

Interestingly, whereas for pairwise interactions, BoxProp-SAWT gives tighter bounds 
than BoxProp-SubT, for the factor graphs considered here, the bounds calculated by 
BoxProp-SAWT were generally less tight than those calculated by BoxProp-SubT. This 
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Figure 15: Results for grids with ternary variables. The first four graphs show, for dif- 
ferent values of the interaction strength /3, the gaps of bounds on marginals 

calculated using different methods. Also shown for reference are the errors in 
the BP approximations to the same marginals. The final graph shows the total 
computation time for each method. 
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Figure 16: Two of the Promedas factor graphs. 
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Figure 17: Results for nine different factor graphs corresponding to simulated Promedas 
patient cases. In reading order: Belief Propagation errors, BoundProp, 
BoxProp-SubT and BoxProp-SAWT. 
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is presumably due to the local bound ^ needed on the SAW tree, which is quite loose 
compared with the local bound ^ that assumes independent incoming bounds. 

Not only does BoxProp-SubT give the tightest bounds for almost all variables, it is 
also the fastest method. Finally, note that the tightness of these bounds still varies widely 
depending on the instance and on the variable of interest. 

5. Conclusion and discussion 

We have derived two related novel bounds on exact single-variable marginals. Both bounds 
also bound the approximate Belief Propagation marginals. The bounds are calculated by 
propagating convex sets of measures over a subtree of the computation tree, with update 
equations resembling those of BP. For variables with a limited number of possible values, the 
bounds can be computed efficiently. Empirically, our bounds often outperform the existing 
state-of-the-art in that case. Although we have only shown results for factor graphs for 
which exact inference was still tractable (in order to be able to calculate the BP error), 
we would like to stress here that it is not difficult to construct factor graphs for which 
exact inference is no longer tractable but the bounds can still be calculated efficiently. An 
example are large Ising grids (of size m x m with m larger than 30). Indeed, for binary 
Ising grids, the computation time of the bounds (for all variables in the network) scales 
linearly in the number of variables, assuming that we truncate the subtrees and SAW trees 
to a fixed maximum size. 

Whereas the results of different approximate inference methods usually cannot be com- 
bined in order to get a better estimate of marginal probabilities, for bounds one can combine 
different methods simply by taking the tightest bound or the intersection of the bounds. 
Thus it is generally a good thing to have different bounds with different properties (such as 
tightness and computation time). 

An advantage of our methods BoxProp-SubT and BoxProp-SAWT over iterative 
methods like BoundProp and MF-TRW is that the computation time of the iterative 
methods is difficult to predict (since it depends on the number of iterations needed to 
converge which is generally not known a priori). In contrast, the computation time needed 
for our bounds BoxProp-SubT and BoxProp-SAWT only depends on the structure of 
the factor graph (and the chosen subtree) and is independent of the values of the interactions. 
Furthermore, by truncating the tree one can trade some tightness for computation time. 

By far the slowest methods turned out to be those combining the upper bound TRW 
with a lower bound on the partition sum. The problem here is that TRW usually needs 
many iterations to converge, especially for stronger interactions where convergence rate 
can go down significantly. In order to prevent exceedingly long computations, we had to 
hand-tune the convergence criterion of TRW according to the case at hand. 

BoundProp can compete in certain cases with the bounds derived here, but more of- 
ten than not it turned out to be rather slow or did not yield very tight bounds. Although 
BoundProp also propagates bounding boxes over measures, it does this in a slightly differ- 
ent way which does not exploit independence as much as our bounds. On the other hand, 
it can propagate bounding boxes several times, refining the bounds more and more each 
iteration. 
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Regarding the related bounds BoxProp-SubT, BoxProp-SAWT and Ihler-SAWT 
we can draw the following conclusions. For pairwise interactions and variables that have not 
too many possible values, BoxProp-SAWT is the method of choice, yielding the tightest 
bounds without needing too much computation time. The bounds are more accurate than 
the bounds produced by Ihler-SAWT due to the more precise local bound that is used; 
the difference is largest for strong interactions. However, the computation time of this 
more precise local bound is exponential in the number of possible values of the variables, 
whereas the local bound used in Ihler-SAWT is only polynomial in the number of possible 
values of the variables. Therefore, if this number is large, BoxProp-SAWT may be no 
longer applicable in practice, whereas Ihler-SAWT still may be applicable. If factors are 
present that depend on more than two variables, it seems that BoxProp-SubT is the best 
method to obtain tight bounds, especially if the interactions are strong. Note that it is 
not immediately obvious how to extend Ihler-SAWT beyond pairwise interactions, so we 
could not compare with that method in that case. 

This work also raises some new questions and opportunities for future work. First, 
the bounds can be used to general ize the improved c o nditio ns for convergence of Belief 
Propagation that were derived in ( Mooij and Kappen . 200?! ) beyond the special case of 
binary variables with pairwise interactions. Second, it may be possible to combine the 
various ingredients in BoundProp, BoxProp-SubT and BoxProp-SAWT in novel ways 
in order to obtain even better bounds. Third, it is an interesting open question whether the 
bounds can be extended to continuous variables in some way. Finally, although our bounds 
are a step forward in quantifying the error of Belief Propagation, the actual error made by 
BP is often at least one order of magnitude lower than the tightness of these bounds. This 
is due to the fact that (loopy) BP cycles information through loops in the factor graph; this 
cycling apparently improves the results. The interesting and still unanswered question is 
why it makes sense to cycle information in this way and whether this error reduction effect 
can be quantified. 
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